library(MASS)
require(jpeg)
## Loading required package: jpeg
require(softImpute)
## Loading required package: softImpute
## Loading required package: Matrix
## Loaded softImpute 1.4
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:stats':
##
## loadings
library(ISLR)
PCA creates a new set of features (a new coordinate system) which are each a linear combination of the original features. The goal here is to (i) capture most of the “structure” in the data with a relatively low-dimensional linear subspace and (ii) potentially use the new low-D feature space as a remedy for the curse of dimensionality.
In which direction(s) does the data vary the most? Can we create a new coordinate system that emphasizes those directions instead of the original measured features?
set.seed(123)
x <- mvrnorm(n=100, mu=c(3,3), Sigma=matrix(c(.2,2,.1,.2), byrow=TRUE,nrow=2) )
plot(x, xlim=c(0,5),ylim=c(0,5))
abline(0,1,col='red')
We can also think of the first principle component at the linear subspace that is “closest” to the data. Quite similar to regression in that sense.
Terminology: * Loadings * Scores * Proportion of Variance explained * Scree plot * biplot (this is an R function)
From the covariance matrix of \(X\), \(C = \frac{1}{n-1} X^{\textrm{T}}X\). The eigenvectors of the covariance matrix are the principle components because they point in the directions of the strongest variance of the data.
x <- mvrnorm(n=100, mu=c(3,3), Sigma=matrix(c(.2,2,.1,.2), byrow=TRUE,nrow=2) )
x_centered = x - c(3,3)
covariance_mat <- (1/99) * t(x_centered) %*% x_centered
eig <- eigen(covariance_mat)
eig$vectors
## [,1] [,2]
## [1,] -0.7417674 0.6706572
## [2,] -0.6706572 -0.7417674
plot(x_centered)
abline(0, eig$vectors[1,1]/eig$vectors[2,1])
PCA can also be understood as the singular value decomposition of the data matrix X. Don’t worry if you don’t recall what SVD is, just understand it to be a generalization of eigen-decomposition to non-square matrices.
Let’s return to the r functions and gain some more intution about principle components, loadings, and scores.
Load iris data and make pairs plot.
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris0 = iris[,-5]
pairs(iris0,col = rep(2:4, each = 50), lwd = 3)
Compute principal components and make biplot.
iris.pca = prcomp(iris0,scale=F)
biplot(iris.pca, cex = .6)
Check what has been computed.
names(iris.pca)
## [1] "sdev" "rotation" "center" "scale" "x"
iris.pca$sdev
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
iris.pca$rotation
## PC1 PC2 PC3 PC4
## Sepal.Length 0.36138659 -0.65658877 0.58202985 0.3154872
## Sepal.Width -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length 0.85667061 0.17337266 -0.07623608 -0.4798390
## Petal.Width 0.35828920 0.07548102 -0.54583143 0.7536574
The rotation matrix is orthogonal. It satisfies \(R^TR RR^T= I\), the identity matrix.
R <- iris.pca$rotation
t(R)%*%R
## PC1 PC2 PC3 PC4
## PC1 1.000000e+00 -1.630640e-16 -8.326673e-17 5.551115e-17
## PC2 -1.630640e-16 1.000000e+00 6.938894e-17 6.245005e-17
## PC3 -8.326673e-17 6.938894e-17 1.000000e+00 -5.551115e-17
## PC4 5.551115e-17 6.245005e-17 -5.551115e-17 1.000000e+00
R%*%t(R)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.000000e+00 6.245005e-16 2.498002e-16 2.775558e-16
## Sepal.Width 6.245005e-16 1.000000e+00 0.000000e+00 -2.220446e-16
## Petal.Length 2.498002e-16 0.000000e+00 1.000000e+00 1.110223e-16
## Petal.Width 2.775558e-16 -2.220446e-16 1.110223e-16 1.000000e+00
The first principal component is maximal for petal length. That is related to the fact that that variable has the largest variance (See the diagonal entry of the covariance matrix below).
var(iris0)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
## Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
## Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
## Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
Here are the other variables that are computed by .
iris.pca$center
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
iris.pca$scale
## [1] FALSE
head(iris.pca$x)
## PC1 PC2 PC3 PC4
## [1,] -2.684126 -0.3193972 0.02791483 0.002262437
## [2,] -2.714142 0.1770012 0.21046427 0.099026550
## [3,] -2.888991 0.1449494 -0.01790026 0.019968390
## [4,] -2.745343 0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 -0.16867766 -0.024200858
The matrix in is obtained by subtracting the column means in from each column and multiplying the result from the right with the matrix in . Check this for the first row:
as.matrix(iris0[1,]-iris.pca$center)%*%as.matrix(iris.pca$rotation) - iris.pca$x[1,]
## PC1 PC2 PC3 PC4
## 1 0 0 0 0
The first two entries of each row of are plotted in the biplot. The plot symbol is simply the row number. We checked this by looking at the coordinates for a few rows.
iris.pca$x[42,1:2]
## PC1 PC2
## -2.8493687 0.9409606
iris.pca$x[16,1:2]
## PC1 PC2
## -2.386039 -1.338062
Make a plot of the first and second components of all observations, colored by species.
plot(iris.pca$x[,1],iris.pca$x[,2],col = rep(2:4, each = 50), lwd = 3)
We then repeated all this with scaling turned on, i.e.
iris.pca1 = prcomp(iris0,scale=T)
plot(iris.pca1$x[,1],iris.pca1$x[,2],col = rep(2:4, each = 50), lwd = 3)
#define a couple useful functions
set.seed(123)
imagegray = function(A){image(A,col = gray((0:255)/256))} # plot grayscale image
mytrunc = function(x, top = 1, bot = 0){pmax(pmin(x,top),bot)}
Make sure that the JPEG files and are in the same directory as this .Rmd file.
Read a JPEG file, turn it into a grayscale file, rearrange rows so it can be plotted, and plot the image.
A = readJPEG("melencolia.JPG")
dim(A)
## [1] 1011 800 3
A.gray = .2989*A[,,1] + .5870*A[,,2] + .1140*A[,,3]
A.gray = t(A.gray[seq(dim(A.gray)[1],1,by = -1),])
A.gray[1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0.9999 0.9175553 0.9256757 0.9256757
## [2,] 0 0.9999 0.9175553 0.9256757 0.9256757
## [3,] 0 0.9999 0.9175553 0.9256757 0.9256757
## [4,] 0 0.9999 0.9175553 0.9256757 0.9256757
## [5,] 0 0.9999 0.9175553 0.9256757 0.9256757
imagegray(A.gray)
More on this engraving from 1514 may be found atat https://en.wikipedia.org/wiki/Melencolia_I.
The grayscale matrix has all values between 0 and 1. Entries closer to 0 are plotted as darker pixels.
Compute the singular value decomposition of the matrix and plot the logarithms of the singular values.
A.svd = svd(A.gray)
names(A.svd)
## [1] "d" "u" "v"
plot(log10(A.svd$d), main = "Log singular values")
grid(col = 2)
Approximate the grayscale matrix with a rank 30 approximation, obtained from the svd.
k = 30
A.approx = A.svd$u[,1:k]%*%diag(A.svd$d[1:k])%*%t(A.svd$v[,1:k])
imagegray(A.approx+1/2)
The function rescales entries in the matrix so that they lie between 0 and 1 and then plots a grayscale image. As it turns out, the approximating matrix has many entries \(>1\). The overall effect is that contrast is reduced. We can restore contrast by truncating the approximating matrix and then plotting it. we can also plot the difference between the correct and the approximating matrix. This shows that the approximation leaves out a lot of detail, in particular small-scale features and edges.
imagegray(mytrunc(A.approx))
imagegray(A.gray-A.approx)
Redo this with a rank 100 approximation. A lot more detail is now visible.
k = 100
A.approx = A.svd$u[,1:k]%*%diag(A.svd$d[1:k])%*%t(A.svd$v[,1:k])
imagegray(mytrunc(A.approx))
Now work with an image detail (the magic square). A rank 2 approximation preferred uses the boxes quite faithfully. A rank 50 approximation allows us to read all figures in the magic square and shows details of the wing, but fine detail (diagonal hashmarks in the square) are largely left out.
A = readJPEG("melencoliaDetail.JPG")
A.gray = .2989*A[,,1] + .5870*A[,,2] + .1140*A[,,3]
A.gray = t(A.gray[seq(dim(A.gray)[1],1,by = -1),])
imagegray(A.gray)
A.svd = svd(A.gray)
k = 2
A.approx = A.svd$u[,1:k]%*%diag(A.svd$d[1:k])%*%t(A.svd$v[,1:k])
imagegray(mytrunc(A.approx))
k = 50
A.approx = A.svd$u[,1:k]%*%diag(A.svd$d[1:k])%*%t(A.svd$v[,1:k])
imagegray(mytrunc(A.approx))
Take a grayscale matrix and delete 50% of all entries. Now the figures are difficult to read.
A.inc = A.gray
pixels = dim(A.inc)[1]*dim(A.inc)[2]
deleteratio = .5
incomplete = sample(pixels, deleteratio*pixels, replace = F)
A.inc[incomplete] <- NA
imagegray(A.inc)
Use the function to fill in the missing entries. Here, we assume that the filled in matrix has rank 100.
A.completion = softImpute(A.inc, rank.max = 100)
names(A.completion)
## [1] "u" "d" "v"
A.comp = A.completion$u%*%diag(A.completion$d)%*%t(A.completion$v)
imagegray(mytrunc(A.comp))
The figures are readable again, and even some of the diagonal stripes are visible.
Now delete a systematic portion of the image. The algorithm is not capable of recovering the lost detail.
A.inc.1 = A.gray
A.inc.1[450:500,50:100] <- NA
A.completion = softImpute(A.inc.1, rank.max = 200)
A.comp.1 = A.completion$u%*%diag(A.completion$d)%*%t(A.completion$v)
imagegray(mytrunc(A.comp.1))
This topic appeared in Chapter 6, but makes more sense to discuss it now.
head(Hitters)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
## -Andy Allanson 293 66 1 30 29 14 1 293 66
## -Alan Ashby 315 81 7 24 38 39 14 3449 835
## -Alvin Davis 479 130 18 66 72 76 3 1624 457
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575
## -Andres Galarraga 321 87 10 39 42 30 2 396 101
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133
## CHmRun CRuns CRBI CWalks League Division PutOuts Assists
## -Andy Allanson 1 30 29 14 A E 446 33
## -Alan Ashby 69 321 414 375 N W 632 43
## -Alvin Davis 63 224 266 263 A W 880 82
## -Andre Dawson 225 828 838 354 N E 200 11
## -Andres Galarraga 12 48 46 33 N E 805 40
## -Alfredo Griffin 19 501 336 194 A W 282 421
## Errors Salary NewLeague
## -Andy Allanson 20 NA A
## -Alan Ashby 10 475.0 N
## -Alvin Davis 14 480.0 A
## -Andre Dawson 3 500.0 N
## -Andres Galarraga 4 91.5 N
## -Alfredo Griffin 25 750.0 A
Let’s focus on just the first several numeric columns, and we’ll try to predict Salary from the rest, as we’ve done before
hitters_df <- Hitters[,c(1:9,19)]
head(hitters_df)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
## -Andy Allanson 293 66 1 30 29 14 1 293 66
## -Alan Ashby 315 81 7 24 38 39 14 3449 835
## -Alvin Davis 479 130 18 66 72 76 3 1624 457
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575
## -Andres Galarraga 321 87 10 39 42 30 2 396 101
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133
## Salary
## -Andy Allanson NA
## -Alan Ashby 475.0
## -Alvin Davis 480.0
## -Andre Dawson 500.0
## -Andres Galarraga 91.5
## -Alfredo Griffin 750.0
Notice that many of these predictors are highly correlated. What is the problem with having correlated predictors?
cor(hitters_df[,1:9])
## AtBat Hits HmRun Runs RBI Walks
## AtBat 1.00000000 0.96793882 0.5921985 0.913060329 0.8205392 0.6698451
## Hits 0.96793882 1.00000000 0.5621579 0.922187191 0.8110732 0.6412106
## HmRun 0.59219847 0.56215789 1.0000000 0.650988186 0.8551222 0.4810143
## Runs 0.91306033 0.92218719 0.6509882 1.000000000 0.7982057 0.7322128
## RBI 0.82053923 0.81107323 0.8551222 0.798205666 1.0000000 0.6159971
## Walks 0.66984506 0.64121060 0.4810143 0.732212848 0.6159971 1.0000000
## Years 0.04737174 0.04476656 0.1163183 0.004541271 0.1461681 0.1364750
## CAtBat 0.23552636 0.22756487 0.2218821 0.186497388 0.2946884 0.2771748
## CHits 0.25271704 0.25581487 0.2206266 0.204829712 0.3082011 0.2806711
## Years CAtBat CHits
## AtBat 0.047371739 0.2355264 0.2527170
## Hits 0.044766557 0.2275649 0.2558149
## HmRun 0.116318335 0.2218821 0.2206266
## Runs 0.004541271 0.1864974 0.2048297
## RBI 0.146168128 0.2946884 0.3082011
## Walks 0.136474971 0.2771748 0.2806711
## Years 1.000000000 0.9202894 0.9036311
## CAtBat 0.920289361 1.0000000 0.9950635
## CHits 0.903631057 0.9950635 1.0000000
We’ll use Principal Components Regression to first transform the predictors into a new set of orthogonal predictors.
pcr.fit=pcr(Salary ~ ., data=Hitters ,scale=TRUE, validation ="CV")
summary(pcr.fit)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 452 350.4 351.6 352.2 350.6 347.0 344.1
## adjCV 452 350.2 351.2 351.7 350.0 346.4 343.3
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 348.5 350.6 352.7 352.0 352.7 354.9 355.3
## adjCV 347.5 349.4 351.4 350.6 351.3 353.4 353.7
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 349.3 350.4 341.0 339.8 339.8 342.7
## adjCV 347.5 348.6 339.3 338.0 337.8 340.6
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 38.31 60.16 70.84 79.03 84.29 88.63 92.26
## Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 94.96 96.28 97.26 97.98 98.65 99.15 99.47
## Salary 46.75 46.86 47.76 47.82 47.85 48.10 50.40
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.75 99.89 99.97 99.99 100.00
## Salary 50.55 53.01 53.85 54.61 54.61
names(pcr.fit)
## [1] "coefficients" "scores" "loadings" "Yloadings"
## [5] "projection" "Xmeans" "Ymeans" "fitted.values"
## [9] "residuals" "Xvar" "Xtotvar" "fit.time"
## [13] "na.action" "ncomp" "method" "scale"
## [17] "validation" "call" "terms" "model"
# performed cross-validation to help select the number of princ comps we should keep
plot(MSEP(pcr.fit))
# other helpful visualizations
plot(pcr.fit,ncomp=3)
plot(pcr.fit, "scores") # plot the data points in the first two PCs
plot(pcr.fit, "loadings")
Using the College data, we’ll apply Principle Component Regression to try to predict the variable “Accept”.
Get a baseline for comparison - Use a linear model with all the features included and try to predict the “Accept” feature.
Use ‘pcr()’ to predict the variable Accept. From the summary() output, approximately how many principle components are needed to explain 90% of the variance of ‘Accept’?
Visualize the MSEP versus number of components included in the model.